Lead Scoring Project
¶Author: Marcelo Cruz
Feel free to contact me: https://www.linkedin.com/in/marcelo-cruz-segura
.png)
1. Problem Context
¶Lead scoring is a process of assigning scores to prospects based on their profile and behavioral data in order to prioritize leads, improve close rates, and decrease buying cycles.
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not.
The typical lead conversion rate at X education is around 30%.
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’.
If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.
There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.
Goal from a business perspective:
X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Goal from a Data Scientist perspective:
Our mission is to build a better lead scoring model, targeting an 80% conversion rate and precision score. Using predict_proba(), we'll assess lead probabilities. This project aims to gain insights and emphasize a data-driven approach for success.
| Variable | Description |
|---|---|
| Prospect ID | A unique ID with which the customer is identified. |
| Lead Number | A lead number assigned to each lead procured. |
| Lead Origin | The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. |
| Lead Source | The source of the lead. Includes Google, Organic Search, Olark Chat, etc. |
| Do Not Email | An indicator variable selected by the customer wherein they select whether or not they want to be emailed about the course or not. |
| Do Not Call | An indicator variable selected by the customer wherein they select whether or not they want to be called about the course or not. |
| Converted | The target variable. Indicates whether a lead has been successfully converted or not. |
| TotalVisits | The total number of visits made by the customer on the website. |
| Total Time Spent on Website | The total time spent by the customer on the website. |
| Page Views Per Visit | Average number of pages on the website viewed during the visits. |
| Last Activity | Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. |
| Country | The country of the customer. |
| Specialization | The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. |
| How did you hear about X Education | The source from which the customer heard about X Education. |
| What is your current occupation | Indicates whether the customer is a student, unemployed or employed. |
| What matters most to you in choosing this course | An option selected by the customer indicating what is their main motto behind doing this course. |
| Search | Indicating whether the customer had seen the ad in any of the listed items. |
| Magazine | |
| Newspaper Article | |
| X Education Forums | |
| Newspaper | |
| Digital Advertisement | |
| Through Recommendations | Indicates whether the customer came in through recommendations. |
| Receive More Updates About Our Courses | Indicates whether the customer chose to receive more updates about the courses. |
| Tags | Tags assigned to customers indicating the current status of the lead. |
| Lead Quality | Indicates the quality of lead based on the data and intuition of the employee who has been assigned to the lead. |
| Update me on Supply Chain Content | Indicates whether the customer wants updates on the Supply Chain Content. |
| Get updates on DM Content | Indicates whether the customer wants updates on the DM Content. |
| Lead Profile | A lead level assigned to each customer based on their profile. |
| City | The city of the customer. |
| Asymmetrique Activity Index | An index and score assigned to each customer based on their activity and their profile. |
| Asymmetrique Profile Index | |
| Asymmetrique Activity Score | |
| Asymmetrique Profile Score | |
| I agree to pay the amount through cheque | Indicates whether the customer has agreed to pay the amount through cheque or not. |
| a free copy of Mastering The Interview | Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. |
| Last Notable Activity | The last notable activity performed by the student. |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
from scipy.stats import linregress, uniform
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score, recall_score, roc_auc_score, precision_score, precision_recall_curve, PrecisionRecallDisplay, confusion_matrix
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
df = pd.read_csv('https://raw.githubusercontent.com/CeloCruz/LeadScoring/main/Lead%20Scoring.csv')
df.head()
| Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | Last Activity | Country | Specialization | How did you hear about X Education | What is your current occupation | What matters most to you in choosing a course | Search | Magazine | Newspaper Article | X Education Forums | Newspaper | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags | Lead Quality | Update me on Supply Chain Content | Get updates on DM Content | Lead Profile | City | Asymmetrique Activity Index | Asymmetrique Profile Index | Asymmetrique Activity Score | Asymmetrique Profile Score | I agree to pay the amount through cheque | A free copy of Mastering The Interview | Last Notable Activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737 | API | Olark Chat | No | No | 0 | 0.0 | 0 | 0.0 | Page Visited on Website | NaN | Select | Select | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Interested in other courses | Low in Relevance | No | No | Select | Select | 02.Medium | 02.Medium | 15.0 | 15.0 | No | No | Modified |
| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728 | API | Organic Search | No | No | 0 | 5.0 | 674 | 2.5 | Email Opened | India | Select | Select | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Ringing | NaN | No | No | Select | Select | 02.Medium | 02.Medium | 15.0 | 15.0 | No | No | Email Opened |
| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727 | Landing Page Submission | Direct Traffic | No | No | 1 | 2.0 | 1532 | 2.0 | Email Opened | India | Business Administration | Select | Student | Better Career Prospects | No | No | No | No | No | No | No | No | Will revert after reading the email | Might be | No | No | Potential Lead | Mumbai | 02.Medium | 01.High | 14.0 | 20.0 | No | Yes | Email Opened |
| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719 | Landing Page Submission | Direct Traffic | No | No | 0 | 1.0 | 305 | 1.0 | Unreachable | India | Media and Advertising | Word Of Mouth | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Ringing | Not Sure | No | No | Select | Mumbai | 02.Medium | 01.High | 13.0 | 17.0 | No | No | Modified |
| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681 | Landing Page Submission | No | No | 1 | 2.0 | 1428 | 1.0 | Converted to Lead | India | Select | Other | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Will revert after reading the email | Might be | No | No | Select | Mumbai | 02.Medium | 01.High | 15.0 | 18.0 | No | No | Modified |
Shape and info about the dataset
df.shape
(9240, 37)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9240 entries, 0 to 9239 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Prospect ID 9240 non-null object 1 Lead Number 9240 non-null int64 2 Lead Origin 9240 non-null object 3 Lead Source 9204 non-null object 4 Do Not Email 9240 non-null object 5 Do Not Call 9240 non-null object 6 Converted 9240 non-null int64 7 TotalVisits 9103 non-null float64 8 Total Time Spent on Website 9240 non-null int64 9 Page Views Per Visit 9103 non-null float64 10 Last Activity 9137 non-null object 11 Country 6779 non-null object 12 Specialization 7802 non-null object 13 How did you hear about X Education 7033 non-null object 14 What is your current occupation 6550 non-null object 15 What matters most to you in choosing a course 6531 non-null object 16 Search 9240 non-null object 17 Magazine 9240 non-null object 18 Newspaper Article 9240 non-null object 19 X Education Forums 9240 non-null object 20 Newspaper 9240 non-null object 21 Digital Advertisement 9240 non-null object 22 Through Recommendations 9240 non-null object 23 Receive More Updates About Our Courses 9240 non-null object 24 Tags 5887 non-null object 25 Lead Quality 4473 non-null object 26 Update me on Supply Chain Content 9240 non-null object 27 Get updates on DM Content 9240 non-null object 28 Lead Profile 6531 non-null object 29 City 7820 non-null object 30 Asymmetrique Activity Index 5022 non-null object 31 Asymmetrique Profile Index 5022 non-null object 32 Asymmetrique Activity Score 5022 non-null float64 33 Asymmetrique Profile Score 5022 non-null float64 34 I agree to pay the amount through cheque 9240 non-null object 35 A free copy of Mastering The Interview 9240 non-null object 36 Last Notable Activity 9240 non-null object dtypes: float64(4), int64(3), object(30) memory usage: 2.6+ MB
Initial thoughts and action plan: 1.
Check if the columns specified are really binary
binary_cats = ['Do Not Email','Do Not Call','Search','Magazine','Newspaper Article',
'X Education Forums','Newspaper','Digital Advertisement','Through Recommendations',
'Receive More Updates About Our Courses', 'Update me on Supply Chain Content','Get updates on DM Content',
'I agree to pay the amount through cheque', 'A free copy of Mastering The Interview']
null_values = df[binary_cats].isnull().sum()
total = df[binary_cats].count()
yes_no = df[binary_cats].applymap(lambda x: 1 if x == 'Yes' or x == 'No' else 0).sum()
df_binary_cats = pd.DataFrame({'total': total,
'null_%': null_values/total*100,
'yes/no_%': yes_no/total*100})
df_binary_cats
| total | null_% | yes/no_% | |
|---|---|---|---|
| Do Not Email | 9240 | 0.0 | 100.0 |
| Do Not Call | 9240 | 0.0 | 100.0 |
| Search | 9240 | 0.0 | 100.0 |
| Magazine | 9240 | 0.0 | 100.0 |
| Newspaper Article | 9240 | 0.0 | 100.0 |
| X Education Forums | 9240 | 0.0 | 100.0 |
| Newspaper | 9240 | 0.0 | 100.0 |
| Digital Advertisement | 9240 | 0.0 | 100.0 |
| Through Recommendations | 9240 | 0.0 | 100.0 |
| Receive More Updates About Our Courses | 9240 | 0.0 | 100.0 |
| Update me on Supply Chain Content | 9240 | 0.0 | 100.0 |
| Get updates on DM Content | 9240 | 0.0 | 100.0 |
| I agree to pay the amount through cheque | 9240 | 0.0 | 100.0 |
| A free copy of Mastering The Interview | 9240 | 0.0 | 100.0 |
Let's separate train and test set before keep seeing more info.
Separating train and test data is essential to avoid data leakage, evaluate model generalization, and make unbiased performance assessments in machine learning. It ensures robust model development and reliable predictions on new, unseen data.
Why stratify by target label?
Stratifying train and test datasets in classification ensures balanced class representation, guarding against biased or imbalanced model learning. It promotes accurate evaluation, preventing skewed performance metrics.
train, test = train_test_split(df, test_size=.2, random_state=12, stratify=df['Converted'])
print(f'train shape: {train.shape}')
print(f'test shape: {test.shape}')
train shape: (7392, 37) test shape: (1848, 37)
print(f'In the train set are {train.duplicated().sum()} duplicates')
In the train set are 0 duplicates
Check the values in Asymmetrique Index columns
train['Asymmetrique Profile Index'].value_counts(dropna=False)
NaN 3362 02.Medium 2243 01.High 1762 03.Low 25 Name: Asymmetrique Profile Index, dtype: int64
train['Asymmetrique Activity Index'].value_counts(dropna=False)
NaN 3362 02.Medium 3080 01.High 648 03.Low 302 Name: Asymmetrique Activity Index, dtype: int64
4. Data cleaning & Feature Engineering
Let us embark on our first data cleaning endeavor! Our strategy involves transforming each step into Scikit-learn transformation objects, harmonizing the entire process into a unified pipeline.
Why is it a commendable practice to conduct all preprocessing tasks using Scikit-learn?
By encapsulating each step into transformation objects, we nurture modularity and reusability. This seamless integration in pipelines ensures consistent application to both training and test datasets, simplifying model selection and tuning while optimizing efficiency and scalability. Ultimately, this fosters a standardized and maintainable machine learning workflow.
def data_cleaning(df):
"""Do some of the data cleaning procedures that we
specified at the begining of the notebook"""
# drop columns id columns
df = df.drop(['Prospect ID','Lead Number'], axis=1)
# asymmetrique index columns transformation
df['Asymmetrique Activity Index'] = df['Asymmetrique Activity Index'].str.split('.', expand=True)[0]\
.str.replace('0','').str.replace('1','4')\
.str.replace('3','1').str.replace('4','3')\
.astype(np.float64
)
df['Asymmetrique Profile Index'] = df['Asymmetrique Profile Index'].str.split('.', expand=True)[0]\
.str.replace('0','').str.replace('1','4')\
.str.replace('3','1').str.replace('4','3')\
.astype(np.float64
)
# binary encoding
df[binary_cats] = df[binary_cats].applymap(lambda x: 0 if x == 'No' else 1)
# rename columns for practicity
df.columns = df.columns.str.replace(' ','_').str.lower()
return df
# Convert custom function into transformer
initial_clean = FunctionTransformer(data_cleaning)
train_clean = initial_clean.fit_transform(train);
In this stage, we'll first inspect the categorical columns from a practical and business-oriented perspective, before delving into more advanced statistical analysis.
I firmly believe that simplicity often holds the key to effective solutions.
The goal is to take a first look through all category columns to do some feature engineering, extract some initial thoughts for future EDA/feature engineer and handling missing values.
For the sake of the notebook's shortness, I omitted outputs. Feel free to download the code and check it yourself!
train_clean.lead_origin.value_counts(dropna=False);
train_clean.lead_source.value_counts(dropna=False);
train_clean.last_activity.value_counts(dropna=False);
train_clean.country.value_counts(dropna=False);
train_clean.specialization.value_counts(dropna=False);
train_clean.how_did_you_hear_about_x_education.value_counts(dropna=False);
train_clean.what_is_your_current_occupation.value_counts(dropna=False);
train_clean.what_matters_most_to_you_in_choosing_a_course.value_counts(dropna=False);
train_clean.tags.value_counts(dropna=False);
train_clean.lead_quality.value_counts(dropna=False);
train_clean.lead_profile.value_counts(dropna=False);
train_clean.city.value_counts(dropna=False);
train_clean.last_notable_activity.value_counts(dropna=False);
Apply initial changes described in the previous insights.
def initial_feature_engineering(df):
"""Do some feature engineering"""
# lead_source
df['lead_source'] = df['lead_source'].str.replace('|'.join(['google','Pay per Click Ads']),'Google')
df['lead_source'] = df['lead_source'].apply(lambda x: "Referral Sites" if 'blog' in str(x) else x)
df['lead_source'] = df['lead_source'].str.replace('Live Chat','Olark Chat')
df['lead_source'] = df['lead_source'].str.replace('bing','Organic Search')
df['lead_source'] = df[df['lead_source'] != 'Other'].lead_source.apply(lambda x: "Other" if str(x) not in train_clean.lead_source.value_counts()[:8].index else x)
# last_activity and last_notable_activity
activity = ['last_activity','last_notable_activity']
df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Email Received','SMS Sent']),'SMS/Email Sent'))
df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Email Marked Spam','Email Bounced','Unsubscribed']),'Not interested in email'))
df[activity] = df[activity].apply(lambda x: x.str.replace('Resubscribed to emails','Email Opened'))
df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Visited Booth in Tradeshow','View in browser link Clicked']),'Page Visited on Website'))
# country
df['country'] = df['country'].apply(lambda x: np.nan if x in ['Unknown','unknown','Asia/Pacific Region'] else x)
# specialization
df['specialization'] = df['specialization'].str.replace('|'.join(['E-COMMERCE','E-Business']),'E-commerce')
df['specialization'] = df['specialization'].str.replace('Banking, Investment And Insurance','Finance Management')
df['specialization'] = df['specialization'].str.replace('Media and Advertising','Marketing Management')
df['specialization'] = df['specialization'].str.replace('Select','Not Provided')
# how_did_you_hear
df['how_did_you_hear_about_x_education'] = df['how_did_you_hear_about_x_education'].str.replace('Select','Not Provided')
df['how_did_you_hear_about_x_education'] = df['how_did_you_hear_about_x_education'].str.replace('|'.join(['SMS','Email']),'SMS/Email')
# importance_in_course
df['what_matters_most_to_you_in_choosing_a_course'] = df['what_matters_most_to_you_in_choosing_a_course'].str.replace('|'.join(['Flexibility & Convenience','Other']),"Better Career Prospects")
# lead_profile
df['lead_profile'] = df['lead_profile'].str.replace('Select','Not Assigned')
# city
df['city'] = df['city'].str.replace('Select','Not Provided')
return df
initial_feature_engineering = FunctionTransformer(initial_feature_engineering)
train_clean = initial_feature_engineering.fit_transform(train_clean);
Copy of the dataset and visualizations style
train_ = train_clean.copy()
# Set style for better visualizations
train_eda = train.copy()
sns.set_style('dark')
sns.set(rc={'axes.grid':False})
sns.set_palette('viridis')
null_ = pd.DataFrame()
null_['proportion'] = np.round(train_clean.isnull().sum()/len(train_clean),4) * 100
null_['amount'] = train_clean.isnull().sum()
# Show only those columns with at least 1 missing value
null_.sort_values(by='proportion', ascending=False)[null_.amount > 0]
| proportion | amount | |
|---|---|---|
| lead_quality | 51.35 | 3796 |
| asymmetrique_activity_index | 45.48 | 3362 |
| asymmetrique_profile_score | 45.48 | 3362 |
| asymmetrique_profile_index | 45.48 | 3362 |
| asymmetrique_activity_score | 45.48 | 3362 |
| tags | 36.35 | 2687 |
| lead_profile | 29.40 | 2173 |
| what_matters_most_to_you_in_choosing_a_course | 29.40 | 2173 |
| what_is_your_current_occupation | 29.21 | 2159 |
| country | 26.50 | 1959 |
| how_did_you_hear_about_x_education | 23.92 | 1768 |
| specialization | 15.61 | 1154 |
| city | 15.41 | 1139 |
| page_views_per_visit | 1.45 | 107 |
| totalvisits | 1.45 | 107 |
| last_activity | 1.08 | 80 |
Define some plot functions
def barplot_catcols(column,width,heigh):
"""Plot conversion rate"""
fig, ax = plt.subplots(figsize=(width,heigh))
ax = sns.barplot(data=train_.fillna('NaN'), x='converted', y=column,
order=order(train_.fillna('NaN'),column),
orient='h', palette='viridis',
seed=2)
plt.title(f'Conversion Rate by {column.replace("_"," ").title()}', loc='left', size=18)
return ax
def order(df,x,y=None):
if y is not None:
return df.groupby(x)[y].mean().sort_values(ascending=False).index
else:
return df.groupby(x)['converted'].mean().sort_values(ascending=False).index
# Number of missing values in each row
train_['amount_missing'] = train_.isnull().sum(1)
# Plot the relation between amount missing and conversion rate
fig, ax = plt.subplots(figsize=(8,5))
ax = sns.barplot(data=train_.fillna('NaN'), x='converted', y='amount_missing',
orient='h', palette='viridis',
seed=2)
plt.title(f'Conversion Rate by Amount Missing', loc='left', size=20)
plt.show()
fig, ax = plt.subplots(figsize=(8,2))
ax = sns.barplot(data=train_, x='amount_missing', y='converted',
orient='h', palette=sns.color_palette('viridis',2),
seed=2)
plt.title(f'Amount missing by leads conversion', loc='left', size=18)
plt.show()
correlations = train_.corr()['converted'].sort_values(ascending=False)
plt.figure(figsize=(8, 8))
correlations[1:].plot(kind='barh',
color=sns.color_palette('viridis', len(correlations)))
plt.title('Correlation with the target variable', fontsize=20)
plt.xlabel('Correlation')
plt.ylabel('Features')
plt.show()
print(f'Duplicate rows from original dataset: {train.duplicated().sum()}')
print(f'Duplicate rows after feature engineer: {train_clean.duplicated().sum()}')
Duplicate rows from original dataset: 0 Duplicate rows after feature engineer: 984
6. Exploratory Data Analysis
¶Considering the prevalence of categorical or binary variables, we'll treat "NaN" values as a distinct category for comparison. For numerical columns with few "NaN" values, we'll exclude them to ensure robust analysis. This follows EDA best practices for gaining valuable insights from the dataset.
count = train_['converted'].value_counts()
fig, ax = plt.subplots(figsize=(10, 5))
ax.pie(count, labels=count.index, autopct='%1.1f%%', startangle=90, colors=['#29568CFF', '#3CBB75FF'])
ax.set_title('Converted', size=20)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
plt.axis('equal')
plt.show()
train_.loc[:,'asymmetrique_activity_index':'asymmetrique_profile_score'].corr().style.background_gradient(cmap='vlag_r')
| asymmetrique_activity_index | asymmetrique_profile_index | asymmetrique_activity_score | asymmetrique_profile_score | |
|---|---|---|---|---|
| asymmetrique_activity_index | 1.000000 | -0.145399 | 0.855985 | -0.122669 |
| asymmetrique_profile_index | -0.145399 | 1.000000 | -0.145366 | 0.883177 |
| asymmetrique_activity_score | 0.855985 | -0.145366 | 1.000000 | -0.114636 |
| asymmetrique_profile_score | -0.122669 | 0.883177 | -0.114636 | 1.000000 |
fig, ax = plt.subplots(1,2, figsize=(12,6), sharey=True)
sns.barplot(data=train_.fillna('NaN'), x='lead_profile', y='converted',
palette='viridis', order=order(train_.fillna('NaN'),'lead_profile'),
seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Lead Profile', loc='left', size=16)
sns.barplot(data=train_.fillna('NaN'), x='asymmetrique_profile_score', y='converted',
palette='viridis', order=order(train_.fillna('NaN'),'asymmetrique_profile_score'),
seed=2, ax=ax[1])
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Asymmetrique Profile Score', loc='left', size=16)
plt.tight_layout()
plt.show()
Correlation between activity track record (columns related with the web) and activity/profile score
activity_columns = ['totalvisits','total_time_spent_on_website','page_views_per_visit',
'asymmetrique_profile_score','asymmetrique_activity_score']
train_[activity_columns].corr().style.background_gradient(cmap='vlag_r')
| totalvisits | total_time_spent_on_website | page_views_per_visit | asymmetrique_profile_score | asymmetrique_activity_score | |
|---|---|---|---|---|---|
| totalvisits | 1.000000 | 0.261952 | 0.598883 | 0.129016 | -0.061397 |
| total_time_spent_on_website | 0.261952 | 1.000000 | 0.323684 | 0.167992 | -0.066008 |
| page_views_per_visit | 0.598883 | 0.323684 | 1.000000 | 0.165945 | -0.171264 |
| asymmetrique_profile_score | 0.129016 | 0.167992 | 0.165945 | 1.000000 | -0.114636 |
| asymmetrique_activity_score | -0.061397 | -0.066008 | -0.171264 | -0.114636 | 1.000000 |
Having columns about last activity and last notable activity provides more information?
fig, ax = plt.subplots(1,2, figsize=(12,6), sharey=True)
sns.barplot(data=train_.fillna('NaN'), x='last_activity', y='converted',
order=order(train_.fillna('NaN'),'last_activity'),
palette='viridis',
seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Last Activity', loc='left', size=16)
sns.barplot(data=train_.fillna('NaN'), x='last_notable_activity', y='converted',
order=order(train_.fillna('NaN'),'last_notable_activity'),
palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Last Notable Activity', loc='left', size=16)
plt.tight_layout()
plt.show()
barplot_catcols('lead_quality',8,3)
plt.show()
fig, ax = plt.subplots(figsize=(13,4))
sns.barplot(data=train_.fillna('NaN'), x='tags', y='converted',
order=order(train_.fillna('NaN'),'tags'),
palette='viridis',
seed=2)
plt.xticks(rotation=90)
plt.title(f'Conversion Rate by Tags', loc='left', size=20)
plt.show()
fig, ax = plt.subplots(1,2, figsize=(14,7), sharey=True)
sns.barplot(data=train_.fillna('NaN'), x='specialization', y='converted',
order=order(train_.fillna('NaN'),'specialization'),
palette='viridis',
seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Specialization', loc='left', size=16)
sns.barplot(data=train_.fillna('NaN'), x='what_is_your_current_occupation', y='converted',
order=order(train_.fillna('NaN'),'what_is_your_current_occupation'),
palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Occupation', loc='left', size=16)
plt.tight_layout()
plt.show()
Number of missing values for each row in these two categories
train_[['what_is_your_current_occupation','specialization']].isnull().sum(1).value_counts()
0 5220 2 1141 1 1031 dtype: int64
conversion_country = train_.groupby('country')['converted'].mean()
country_count = train_['country'].value_counts().sort_index()
fig = go.Figure(data=go.Choropleth(
locations=conversion_country.index,
locationmode='country names',
z=conversion_country.values,
text=country_count.values,
colorscale='deep',
colorbar_title='Conversion Rate',
hovertemplate='%{location}<br>Conversion: %{z:.2f}<br>Count: %{text}',
))
fig.update_geos(projection_type="mercator")
fig.update_layout(
title='Conversion Rate by Country',
geo=dict(showcoastlines=True),
font=dict(size=16),
)
fig.show()
train_['country'].value_counts().sort_index()
Australia 9 Bahrain 6 Bangladesh 2 Belgium 2 Canada 3 China 2 Denmark 1 France 4 Germany 4 Ghana 2 Hong Kong 3 India 5201 Italy 2 Kenya 1 Kuwait 3 Malaysia 1 Netherlands 2 Nigeria 4 Oman 4 Philippines 1 Qatar 9 Russia 1 Saudi Arabia 17 Singapore 21 South Africa 3 Sri Lanka 1 Sweden 2 Switzerland 1 Tanzania 1 Uganda 1 United Arab Emirates 49 United Kingdom 13 United States 57 Name: country, dtype: int64
barplot_catcols('city',8,4)
plt.show()
Is the geographic data correct?
print("Cities where country isn't India:")
train_[train_['country'] != 'India'].city.value_counts(dropna=False)
Cities where country isn't India:
Not Provided 992 NaN 693 Mumbai 244 Other Cities 98 Thane & Outskirts 83 Other Cities of Maharashtra 49 Other Metro Cities 27 Tier II Cities 5 Name: city, dtype: int64
print('Countries where City es equal to an Indian city:')
indian_cities = ['Mumbai','Thane & Outskirts','Other Cities of Maharashtra','Tier II Cities']
train_[train_.city.isin(indian_cities)].country.value_counts(dropna=False)
Countries where City es equal to an Indian city:
India 3220 NaN 270 United States 32 United Arab Emirates 19 Singapore 11 United Kingdom 9 Saudi Arabia 8 Australia 6 Qatar 5 Bahrain 4 Germany 3 Belgium 2 Canada 2 Netherlands 2 Kuwait 1 France 1 Sweden 1 Malaysia 1 Hong Kong 1 Switzerland 1 Oman 1 China 1 Name: country, dtype: int64
fig, ax = plt.subplots(1,2, figsize=(14,7), sharey=True)
sns.barplot(data=train_.fillna('NaN'), x='lead_source', y='converted',
order=order(train_.fillna('NaN'),'lead_source'),
palette='viridis',
seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Lead Source', loc='left', size=16)
sns.barplot(data=train_.fillna('NaN'), x='how_did_you_hear_about_x_education', y='converted',
order=order(train_.fillna('NaN'),'how_did_you_hear_about_x_education'),
palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by How Did You Hear About It', loc='left', size=16)
plt.tight_layout()
plt.show()
train_.select_dtypes(include=['number']).nunique().sort_values()
i_agree_to_pay_the_amount_through_cheque 1 get_updates_on_dm_content 1 update_me_on_supply_chain_content 1 receive_more_updates_about_our_courses 1 magazine 1 do_not_email 2 through_recommendations 2 a_free_copy_of_mastering_the_interview 2 newspaper 2 digital_advertisement 2 newspaper_article 2 search 2 converted 2 do_not_call 2 x_education_forums 2 asymmetrique_activity_index 3 asymmetrique_profile_index 3 asymmetrique_profile_score 10 asymmetrique_activity_score 11 amount_missing 14 totalvisits 40 page_views_per_visit 103 total_time_spent_on_website 1635 dtype: int64
fig, ax = plt.subplots(3, figsize=(8,6))
sns.barplot(data=train_, x='totalvisits', y='converted',
orient='h', palette='viridis',
seed=2, ax=ax[0])
ax[0].set_title(f'Avg. Number of visits', loc='left', size=18)
sns.barplot(data=train_, x='total_time_spent_on_website', y='converted',
orient='h', palette='viridis',
seed=2, ax=ax[1])
ax[1].set_title(f'Avg. Time spent on website', loc='left', size=18)
sns.barplot(data=train_, x='page_views_per_visit', y='converted',
orient='h', palette='viridis',
seed=2, ax=ax[2])
ax[2].set_title(f'Avg. Page views per visit', loc='left', size=18)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(3,1, figsize=(8,6))
sns.boxplot(data=train_, x='totalvisits',
ax=ax[0], palette='viridis')
ax[0].set_title('Total Visits', loc='left', size=16)
sns.boxplot(data=train_, x='total_time_spent_on_website',
ax=ax[1], palette='viridis')
ax[1].set_title('Time spent on web', loc='left', size=16)
sns.boxplot(data=train_, x='page_views_per_visit',
ax=ax[2], palette='viridis')
ax[2].set_title('Page views per visit', loc='left', size=16)
plt.tight_layout()
plt.show()
7. Data Wrangling
¶Outliers Treatment:
Addressing outliers in TotalVisits and Page Views Per Visit is essential for model performance, particularly in Logistic Regression. Capping these variables at the 95th percentile is recommended for model stability and preventing inflated coefficients. It enhances model generalization in various classification models like Decision Trees, Random Forests, and Support Vector Machines.
Missing Values Strategy:
Numeric Columns (KNN Imputation): Utilizing KNNImputer for imputing missing values in Total Visits and Page Views Per Visit is a preferable choice over median, mean, or mode imputation. KNNImputer considers feature relationships, preserving data distribution, and handling multicollinearity effectively.
Categorical Columns (Missing Category): Treating missing values as a separate category, rather than imputing with the mode, maintains data integrity, avoids biases, and improves model reliability and accuracy, especially considering the significant difference in conversion rate between leads with missing records and others.
Let's apply all the insights discovered during EDA.
def eda_feature_engineering(df):
# tags column
df['tags'] = df['tags'].str.replace('|'.join(['invalid number','wrong number given','number not provided']),'Not interest in calls')
df['tags'] = df['tags'].str.replace('|'.join(["In confusion whether part time or DLP", "Interested in Next batch", "Shall take in the next coming month", "Still Thinking"]), "Shows certain interest")
df['tags'] = df['tags'].str.replace("University not recognized","Not elegible")
df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Not elegible' if 'holder' in x else x)
df['tags'] = df['tags'].str.replace('|'.join(["Interested in other courses", "Interested in full time MBA", "Not doing further education"]),"Doesn't show interest")
df['tags'] = df['tags'].str.replace('|'.join(["Ringing","switched off"]),"Still no contact")
df['tags'] = df['tags'].str.replace('|'.join(["Want to take admission but has financial problems", "Graduation in progress"]),"Not elegible for the moment")
df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Not elegible for the moment' if 'Recognition' in x else x)
df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Other' if x not in df.tags.value_counts(dropna=False)[:12] else x)
# country and city
indian_cities = ['Mumbai','Thane & Outskirts','Other Cities of Maharashtra','Tier II Cities']
df.loc[(df.country != 'India') & (df.city.isin(indian_cities)),'country'] = 'India'
df['country'] = df.loc[df['country'].notnull(),'country'].apply(lambda x: 'Other' if x not in df.loc[df['country'] != 'Other','country'].value_counts()[:4] else x)
# lead quality
df['lead_quality'] = df['lead_quality'].fillna('Not Sure')
# convert asymmetrique index columns in strings columns
df[['asymmetrique_profile_index','asymmetrique_activity_index']] = df[['asymmetrique_profile_index','asymmetrique_activity_index']].astype(str)
# drop columns with unique values
drop_cols = ['magazine','receive_more_updates_about_our_courses','update_me_on_supply_chain_content',
'get_updates_on_dm_content','i_agree_to_pay_the_amount_through_cheque']
df = df.drop(drop_cols, axis=1)
#add amount_missing column
df['amount_missing'] = df.isnull().sum(1)
return df
eda_feature_engineering = FunctionTransformer(eda_feature_engineering)
def cap_outliers(df):
"""Replace outliers with the 95th percentile"""
num_cols = ['totalvisits','page_views_per_visit','total_time_spent_on_website']
df[num_cols[0]].apply(lambda x: df[num_cols[0]].quantile(.95) if x > df[num_cols[0]].quantile(.95) else x)
df[num_cols[1]].apply(lambda x: df[num_cols[1]].quantile(.95) if x > df[num_cols[1]].quantile(.95) else x)
df[num_cols[2]].apply(lambda x: df[num_cols[2]].quantile(.95) if x > df[num_cols[2]].quantile(.95) else x)
return df
cap_outliers = FunctionTransformer(cap_outliers);
OneHotEncoder to all the categorical columns.StandardScaler to the numeric columns if there aren't binary.cat_columns = ['lead_origin','lead_source','country','what_is_your_current_occupation',
'what_matters_most_to_you_in_choosing_a_course','tags','lead_quality',
'city','last_notable_activity']
num_cols = ['totalvisits','page_views_per_visit','total_time_spent_on_website',
'asymmetrique_activity_score','asymmetrique_profile_score','amount_missing']
impute_knn = KNNImputer(n_neighbors=5)
impute_cons = SimpleImputer(strategy='constant', fill_value='Missing')
ohe = OneHotEncoder(handle_unknown='ignore')
sc = StandardScaler()
# Make pipelines for both type of columns treatments
pipe_cat = make_pipeline(impute_cons,ohe)
pipe_num = make_pipeline(sc,impute_knn)
impute_scale = make_column_transformer(
(pipe_cat, cat_columns),
(pipe_num,num_cols),
remainder='drop'
)
X_train = train.drop('Converted',axis=1)
y_train = train.loc[:,'Converted']
Creating a comprehensive preprocessing pipeline for ML is essential for consistency, efficiency, and reproducibility. It prevents data leakage, simplifies scaling, and integrates hyperparameter tuning seamlessly. Such a pipeline also aids in model deployment, enhancing performance, and maintaining a reliable ML workflow.
pipe = make_pipeline(
initial_clean,
initial_feature_engineering,
eda_feature_engineering,
cap_outliers,
impute_scale
)
# Let's see how it looks
pipe
Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)),
('functiontransformer-2',
FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)),
('functiontransformer-3',
FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)),
('functiontransformer-...
'what_is_your_current_occupation',
'what_matters_most_to_you_in_choosing_a_course',
'tags', 'lead_quality',
'city',
'last_notable_activity']),
('pipeline-2',
Pipeline(steps=[('standardscaler',
StandardScaler()),
('knnimputer',
KNNImputer())]),
['totalvisits',
'page_views_per_visit',
'total_time_spent_on_website',
'asymmetrique_activity_score',
'asymmetrique_profile_score',
'amount_missing'])]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)),
('functiontransformer-2',
FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)),
('functiontransformer-3',
FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)),
('functiontransformer-...
'what_is_your_current_occupation',
'what_matters_most_to_you_in_choosing_a_course',
'tags', 'lead_quality',
'city',
'last_notable_activity']),
('pipeline-2',
Pipeline(steps=[('standardscaler',
StandardScaler()),
('knnimputer',
KNNImputer())]),
['totalvisits',
'page_views_per_visit',
'total_time_spent_on_website',
'asymmetrique_activity_score',
'asymmetrique_profile_score',
'amount_missing'])]))])FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)
FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)
FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)
FunctionTransformer(func=<function cap_outliers at 0x000001FE38373CE0>)
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='Missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['lead_origin', 'lead_source', 'country',
'what_is_your_current_occupation',
'what_matters_most_to_you_in_choosing_a_course',
'tags', 'lead_quality', 'city',
'last_notable_activity']),
('pipeline-2',
Pipeline(steps=[('standardscaler',
StandardScaler()),
('knnimputer', KNNImputer())]),
['totalvisits', 'page_views_per_visit',
'total_time_spent_on_website',
'asymmetrique_activity_score',
'asymmetrique_profile_score',
'amount_missing'])])['lead_origin', 'lead_source', 'country', 'what_is_your_current_occupation', 'what_matters_most_to_you_in_choosing_a_course', 'tags', 'lead_quality', 'city', 'last_notable_activity']
SimpleImputer(fill_value='Missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
['totalvisits', 'page_views_per_visit', 'total_time_spent_on_website', 'asymmetrique_activity_score', 'asymmetrique_profile_score', 'amount_missing']
StandardScaler()
KNNImputer()
X_train_pp = pipe.fit_transform(X_train)
8. Modeling
¶We'll start by exploring models for potential strong performance. First, we'll evaluate them using cross-validation with stratified folds to maintain class proportions. The goal is to identify promising models before fine-tuning hyperparameters.
Display function and SratifiedKFold
# Use stratified fold for ensure that we shuffle the dataset and conserve classes
skfold = StratifiedKFold(5, shuffle=True, random_state=12)
def display_scores(model,scores,pred):
print(f'----------- {model} -----------')
print('')
print("------------------ Cross validation scores:")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
print('')
print("--------------- Scores in the training set:")
print("Precision:", precision_score(y_train,pred))
print("Recall:", recall_score(y_train,pred))
print("F1 score:", f1_score(y_train,pred))
print("ROC - AUC score:", roc_auc_score(y_train,pred))
lr = LogisticRegression()
lr_scores = cross_val_score(lr, X_train_pp, y_train,
cv=skfold, scoring='f1')
lr.fit(X_train_pp,y_train)
lr_pred = lr.predict(X_train_pp)
# Precision and recall curve
lr_prec, lr_recall, lr_threshold = precision_recall_curve(y_train, lr_pred, pos_label=lr.classes_[1])
lr_prdisplay = PrecisionRecallDisplay(precision=lr_prec, recall=lr_recall)
# Display Scores
display_scores('Logistic Regression',lr_scores,lr_pred)
----------- Logistic Regression ----------- ------------------ Cross validation scores: Scores: [0.92017937 0.91607143 0.91785714 0.92252894 0.93191866] Mean: 0.9217111080041696 Standard deviation: 0.005547398183110531 --------------- Scores in the training set: Precision: 0.9390642002176278 Recall: 0.9087399087399087 F1 score: 0.9236532286835533 ROC - AUC score: 0.9358799697782749
svc = SVC()
svc_scores = cross_val_score(svc, X_train_pp, y_train,
cv=skfold, scoring='f1')
svc.fit(X_train_pp, y_train)
svc_pred = svc.predict(X_train_pp)
# Precision and recall curve
svc_prec, svc_recall, svc_threshold = precision_recall_curve(y_train, svc_pred, pos_label=svc.classes_[1])
svc_prdisplay = PrecisionRecallDisplay(precision=svc_prec, recall=svc_recall)
# Display scores
display_scores('Support Vector Machine',svc_scores,svc_pred)
----------- Support Vector Machine ----------- ------------------ Cross validation scores: Scores: [0.92032229 0.92375887 0.93167702 0.9204947 0.93684211] Mean: 0.9266189961289493 Standard deviation: 0.0065640139064668925 --------------- Scores in the training set: Precision: 0.9438684304612084 Recall: 0.9266409266409267 F1 score: 0.9351753453772582 ROC - AUC score: 0.9460411324818105
tree = DecisionTreeClassifier(random_state = 7)
tree_scores = cross_val_score(tree, X_train_pp, y_train,
cv=skfold, scoring='f1')
tree.fit(X_train_pp, y_train)
tree_pred = tree.predict(X_train_pp)
# Precision and recall curve
tree_prec, tree_recall, tree_threshold = precision_recall_curve(y_train, tree_pred, pos_label=tree.classes_[1])
tree_prdisplay = PrecisionRecallDisplay(precision=tree_prec, recall=tree_recall)
# Display scores
display_scores('Decission Tree',tree_scores,tree_pred)
----------- Decission Tree ----------- ------------------ Cross validation scores: Scores: [0.89533861 0.89821429 0.89938758 0.88736028 0.90394511] Mean: 0.8968491718576317 Standard deviation: 0.005495095089980634 --------------- Scores in the training set: Precision: 0.9912434325744308 Recall: 0.9933309933309933 F1 score: 0.9922861150070126 ROC - AUC score: 0.9939140108631633
rf = RandomForestClassifier(random_state=10,
oob_score=True)
rf_scores = cross_val_score(rf, X_train_pp, y_train,
cv=skfold, scoring='f1')
rf.fit(X_train_pp, y_train)
rf_pred = rf.predict(X_train_pp)
rf_pred_proba = rf.predict_proba(X_train_pp)
# Precision and recall curve
rf_prec, rf_recall, rf_threshold = precision_recall_curve(y_train, rf_pred_proba[:,1], pos_label=rf.classes_[1])
rf_prdisplay = PrecisionRecallDisplay(precision=rf_prec, recall=rf_recall)
# Display scores
display_scores('Random Forest',rf_scores,rf_pred)
print('Oob score: ',rf.oob_score_)
----------- Random Forest ----------- ------------------ Cross validation scores: Scores: [0.9204647 0.92086331 0.93167702 0.92537313 0.9434629 ] Mean: 0.9283682120932953 Standard deviation: 0.008562210912321892 --------------- Scores in the training set: Precision: 0.9908995449772489 Recall: 0.9936819936819937 F1 score: 0.9922888187872415 ROC - AUC score: 0.9939794516065702 Oob score: 0.9460227272727273
xg = GradientBoostingClassifier(random_state=11)
xg_scores = cross_val_score(xg, X_train_pp, y_train,
cv=skfold, scoring='f1')
xg.fit(X_train_pp, y_train)
xg_pred = xg.predict(X_train_pp)
# Precision and recall curve
xg_prec, xg_recall, xg_threshold = precision_recall_curve(y_train, xg_pred, pos_label=xg.classes_[1])
xg_prdisplay = PrecisionRecallDisplay(precision=xg_prec, recall=xg_recall)
# Display scores
display_scores('Gradient Boosting',xg_scores,xg_pred)
----------- Gradient Boosting ----------- ------------------ Cross validation scores: Scores: [0.92072072 0.92558984 0.92844365 0.92665474 0.94044444] Mean: 0.9283706783615786 Standard deviation: 0.006557141557698687 --------------- Scores in the training set: Precision: 0.9537953795379538 Recall: 0.9129519129519129 F1 score: 0.9329268292682927 ROC - AUC score: 0.9426084680321968
fig, ax = plt.subplots(figsize=(8,5))
lr_prdisplay.plot(ax=ax, label='Logistic Regression', color='blue', linewidth=2)
svc_prdisplay.plot(ax=ax, label='Support Vector Classifier', color='green', linewidth=2)
tree_prdisplay.plot(ax=ax, label='Decision Tree', color='red', linewidth=2, alpha=.9)
rf_prdisplay.plot(ax=ax, label='Random Forest', color='purple', linewidth=2, alpha=.7)
xg_prdisplay.plot(ax=ax, label='Gradient Boosting', color='orange', linewidth=2, alpha=.5)
plt.title('Precision Recall Curve (training data)', size=16, loc='left')
plt.show()
lr_params = [
{'C': uniform(loc=0, scale=4),
'penalty': ['l1','l2'],
'solver': ['liblinear','saga']}
]
lr_randomcv = RandomizedSearchCV(lr, lr_params, cv=skfold,
scoring='f1',
return_train_score = True,
random_state = 10,
n_iter=100)
lr_randomcv.fit(X_train_pp, y_train)
print("---------------- Logistic Regression ---------------")
print("Best Parameters: ", lr_randomcv.best_params_)
print("Best Score: ", lr_randomcv.best_score_)
---------------- Logistic Regression ---------------
Best Parameters: {'C': 1.4933630402058768, 'penalty': 'l1', 'solver': 'liblinear'}
Best Score: 0.9225907032920444
rf_params = [{
'n_estimators': np.arange(50,500,50),
'criterion': ['gini','entropy','logloss'],
'max_depth': np.arange(2,14,2),
'max_features': ['sqrt','log2',None, 0.5],
}]
rf_randomcv = RandomizedSearchCV(rf, rf_params, cv=skfold,
scoring='f1',
return_train_score = True,
random_state = 10,
n_iter=100)
rf_randomcv.fit(X_train_pp, y_train)
print("----------------- Random Forest ----------------")
print("Best Parameters: ", rf_randomcv.best_params_)
print("Best Score: ", rf_randomcv.best_score_)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[94], line 14 1 rf_params = [{ 2 'n_estimators': np.arange(50,500,50), 3 'criterion': ['gini','entropy','logloss'], 4 'max_depth': np.arange(2,14,2), 5 'max_features': ['sqrt','log2',None, 0.5], 6 }] 8 rf_randomcv = RandomizedSearchCV(rf, rf_params, cv=skfold, 9 scoring='f1', 10 return_train_score = True, 11 random_state = 10, 12 n_iter=100) ---> 14 rf_randomcv.fit(X_train_pp, y_train) 16 print("----------------- Random Forest ----------------") 17 print("Best Parameters: ", rf_randomcv.best_params_) File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:874, in BaseSearchCV.fit(self, X, y, groups, **fit_params) 868 results = self._format_results( 869 all_candidate_params, n_splits, all_out, all_more_results 870 ) 872 return results --> 874 self._run_search(evaluate_candidates) 876 # multimetric is determined here because in the case of a callable 877 # self.scoring the return type is only known after calling 878 first_test_score = all_out[0]["test_scores"] File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1768, in RandomizedSearchCV._run_search(self, evaluate_candidates) 1766 def _run_search(self, evaluate_candidates): 1767 """Search n_iter candidates from param_distributions""" -> 1768 evaluate_candidates( 1769 ParameterSampler( 1770 self.param_distributions, self.n_iter, random_state=self.random_state 1771 ) 1772 ) File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:821, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results) 813 if self.verbose > 0: 814 print( 815 "Fitting {0} folds for each of {1} candidates," 816 " totalling {2} fits".format( 817 n_splits, n_candidates, n_candidates * n_splits 818 ) 819 ) --> 821 out = parallel( 822 delayed(_fit_and_score)( 823 clone(base_estimator), 824 X, 825 y, 826 train=train, 827 test=test, 828 parameters=parameters, 829 split_progress=(split_idx, n_splits), 830 candidate_progress=(cand_idx, n_candidates), 831 **fit_and_score_kwargs, 832 ) 833 for (cand_idx, parameters), (split_idx, (train, test)) in product( 834 enumerate(candidate_params), enumerate(cv.split(X, y, groups)) 835 ) 836 ) 838 if len(out) < 1: 839 raise ValueError( 840 "No fits were performed. " 841 "Was the CV iterator empty? " 842 "Were there no candidates?" 843 ) File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable) 58 config = get_config() 59 iterable_with_config = ( 60 (_with_config(delayed_func, config), args, kwargs) 61 for delayed_func, args, kwargs in iterable 62 ) ---> 63 return super().__call__(iterable_with_config) File ~\anaconda3\Lib\site-packages\joblib\parallel.py:1088, in Parallel.__call__(self, iterable) 1085 if self.dispatch_one_batch(iterator): 1086 self._iterating = self._original_iterator is not None -> 1088 while self.dispatch_one_batch(iterator): 1089 pass 1091 if pre_dispatch == "all" or n_jobs == 1: 1092 # The iterable was consumed all at once by the above for loop. 1093 # No need to wait for async callbacks to trigger to 1094 # consumption. File ~\anaconda3\Lib\site-packages\joblib\parallel.py:901, in Parallel.dispatch_one_batch(self, iterator) 899 return False 900 else: --> 901 self._dispatch(tasks) 902 return True File ~\anaconda3\Lib\site-packages\joblib\parallel.py:819, in Parallel._dispatch(self, batch) 817 with self._lock: 818 job_idx = len(self._jobs) --> 819 job = self._backend.apply_async(batch, callback=cb) 820 # A job can complete so quickly than its callback is 821 # called before we get here, causing self._jobs to 822 # grow. To ensure correct results ordering, .insert is 823 # used (rather than .append) in the following line 824 self._jobs.insert(job_idx, job) File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback) 206 def apply_async(self, func, callback=None): 207 """Schedule a func to be run""" --> 208 result = ImmediateResult(func) 209 if callback: 210 callback(result) File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:597, in ImmediateResult.__init__(self, batch) 594 def __init__(self, batch): 595 # Don't delay the application, to avoid keeping the input 596 # arguments in memory --> 597 self.results = batch() File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.__call__(self) 284 def __call__(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items] File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in <listcomp>(.0) 284 def __call__(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items] File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs) 121 config = {} 122 with config_context(**config): --> 123 return self.function(*args, **kwargs) File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:686, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score) 684 estimator.fit(X_train, **fit_params) 685 else: --> 686 estimator.fit(X_train, y_train, **fit_params) 688 except Exception: 689 # Note fit time as time until error 690 fit_time = time.time() - start_time File ~\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:473, in BaseForest.fit(self, X, y, sample_weight) 462 trees = [ 463 self._make_estimator(append=False, random_state=random_state) 464 for i in range(n_more_estimators) 465 ] 467 # Parallel loop: we prefer the threading backend as the Cython code 468 # for fitting the trees is internally releasing the Python GIL 469 # making threading more efficient than multiprocessing in 470 # that case. However, for joblib 0.12+ we respect any 471 # parallel_backend contexts set at a higher level, 472 # since correctness does not rely on using threads. --> 473 trees = Parallel( 474 n_jobs=self.n_jobs, 475 verbose=self.verbose, 476 prefer="threads", 477 )( 478 delayed(_parallel_build_trees)( 479 t, 480 self.bootstrap, 481 X, 482 y, 483 sample_weight, 484 i, 485 len(trees), 486 verbose=self.verbose, 487 class_weight=self.class_weight, 488 n_samples_bootstrap=n_samples_bootstrap, 489 ) 490 for i, t in enumerate(trees) 491 ) 493 # Collect newly grown trees 494 self.estimators_.extend(trees) File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable) 58 config = get_config() 59 iterable_with_config = ( 60 (_with_config(delayed_func, config), args, kwargs) 61 for delayed_func, args, kwargs in iterable 62 ) ---> 63 return super().__call__(iterable_with_config) File ~\anaconda3\Lib\site-packages\joblib\parallel.py:1088, in Parallel.__call__(self, iterable) 1085 if self.dispatch_one_batch(iterator): 1086 self._iterating = self._original_iterator is not None -> 1088 while self.dispatch_one_batch(iterator): 1089 pass 1091 if pre_dispatch == "all" or n_jobs == 1: 1092 # The iterable was consumed all at once by the above for loop. 1093 # No need to wait for async callbacks to trigger to 1094 # consumption. File ~\anaconda3\Lib\site-packages\joblib\parallel.py:901, in Parallel.dispatch_one_batch(self, iterator) 899 return False 900 else: --> 901 self._dispatch(tasks) 902 return True File ~\anaconda3\Lib\site-packages\joblib\parallel.py:819, in Parallel._dispatch(self, batch) 817 with self._lock: 818 job_idx = len(self._jobs) --> 819 job = self._backend.apply_async(batch, callback=cb) 820 # A job can complete so quickly than its callback is 821 # called before we get here, causing self._jobs to 822 # grow. To ensure correct results ordering, .insert is 823 # used (rather than .append) in the following line 824 self._jobs.insert(job_idx, job) File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback) 206 def apply_async(self, func, callback=None): 207 """Schedule a func to be run""" --> 208 result = ImmediateResult(func) 209 if callback: 210 callback(result) File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:597, in ImmediateResult.__init__(self, batch) 594 def __init__(self, batch): 595 # Don't delay the application, to avoid keeping the input 596 # arguments in memory --> 597 self.results = batch() File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.__call__(self) 284 def __call__(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items] File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in <listcomp>(.0) 284 def __call__(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items] File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs) 121 config = {} 122 with config_context(**config): --> 123 return self.function(*args, **kwargs) File ~\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:184, in _parallel_build_trees(tree, bootstrap, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap) 181 elif class_weight == "balanced_subsample": 182 curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices) --> 184 tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) 185 else: 186 tree.fit(X, y, sample_weight=sample_weight, check_input=False) File ~\anaconda3\Lib\site-packages\sklearn\tree\_classes.py:889, in DecisionTreeClassifier.fit(self, X, y, sample_weight, check_input) 859 def fit(self, X, y, sample_weight=None, check_input=True): 860 """Build a decision tree classifier from the training set (X, y). 861 862 Parameters (...) 886 Fitted estimator. 887 """ --> 889 super().fit( 890 X, 891 y, 892 sample_weight=sample_weight, 893 check_input=check_input, 894 ) 895 return self File ~\anaconda3\Lib\site-packages\sklearn\tree\_classes.py:379, in BaseDecisionTree.fit(self, X, y, sample_weight, check_input) 368 else: 369 builder = BestFirstTreeBuilder( 370 splitter, 371 min_samples_split, (...) 376 self.min_impurity_decrease, 377 ) --> 379 builder.build(self.tree_, X, y, sample_weight) 381 if self.n_outputs_ == 1 and is_classifier(self): 382 self.n_classes_ = self.n_classes_[0] KeyboardInterrupt:
xg_params = [{
'n_estimators': np.arange(50,500,50),
'loss': ['exponential','log_loss'],
'max_depth': np.arange(2,14,2),
'criterion': ['friedman_mse', 'squared_error'],
'learning_rate': uniform(loc=0,scale=.5),
'max_features': ['sqrt', 'log2', None, 0.5]
}]
xg_randomcv = RandomizedSearchCV(xg, xg_params, cv=skfold,
scoring='f1',
return_train_score = True,
random_state = 10,
n_iter=50)
xg_randomcv.fit(X_train_pp, y_train)
print("--------------- Gradient Boosting --------------")
print("Best Parameters: ", xg_randomcv.best_params_)
print("Best Score: ", xg_randomcv.best_score_)
10. Make our predictions
¶At this point, we've already:
By following these steps, we ensure that the data will be treated as if it were completely new. Now, we're all set to apply the entire pipeline and predict the lead scores using the test dataset!
X_test = test.drop('Converted',axis=1)
y_test = test.loc[:,'Converted']
# Let's take a look of the first row
X_test.to_numpy()[:1]
# apply all the preprocessing steps to the test dataset
X_test_pp = pipe.transform(X_test)
X_test_pp.toarray()[:1]
rf_rcv_pred = rf_randomcv.predict(X_test_pp)
print("Precision:", precision_score(y_test,rf_rcv_pred))
print("Recall:", recall_score(y_test,rf_rcv_pred))
print("F1 score:", f1_score(y_test,rf_rcv_pred))
print("ROC - AUC score:", roc_auc_score(y_test,rf_rcv_pred))
rf_pred_test = rf.predict(X_test_pp)
print("Precision:", precision_score(y_test,rf_pred_test))
print("Recall:", recall_score(y_test,rf_pred_test))
print("F1 score:", f1_score(y_test,rf_pred_test))
print("ROC - AUC score:", roc_auc_score(y_test,rf_pred_test))
Upon closer examination, a marginal improvement of 0.0006 in F1 score appears evident in the untuned model. However, as previously emphasized, our focus primarily rested on elevating precision rather than recall. Given the subtle disparity in F1 scores, the preference leans towards the tuned model. This decision stems from a higher increase of roughly 0.0079 in precision, aligning well with our objectives and priorities.
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
# Random Forest tunned
cm1 = confusion_matrix(y_test, rf_rcv_pred)
sns.heatmap(cm1, annot=True, fmt = 'd', cmap='Greens', ax = ax[0], cbar=False)
ax[0].xaxis.set_ticklabels(['Not converted', 'Converted'])
ax[0].yaxis.set_ticklabels(['Not converted', 'Converted'])
ax[0].set_title('RF with hyperparameters tuning', loc='left')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('True')
# Random Forest without tuning
cm2 = confusion_matrix(y_test, rf_pred_test)
sns.heatmap(cm2, annot=True, fmt='d', cmap='Blues', ax=ax[1], cbar=False)
ax[1].xaxis.set_ticklabels(['Not converted', 'Converted'])
ax[1].yaxis.set_ticklabels(['Not converted', 'Converted'])
ax[1].set_title('RF without hyperparameters tuning', loc='left')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('True')
plt.tight_layout()
plt.show()
Class predictions in the left, and probabilities to convert into a customer on the right.
lead_scoring = rf_randomcv.predict_proba(X_test_pp)[:,1]
lead_prediction = rf_rcv_pred
results = np.round(np.c_[lead_prediction,lead_scoring],2)
# Let's take a look of the first 10 rows
results[:10]
In summary, our data science project focused on fine-tuning lead scoring for X Education. We aimed to exceed an 80% precision goal, which we not only met but exceeded. Throughout our journey, we identified key factors like phone interactions, referrals, and online engagement that strongly correlated with lead conversion, leading to actionable strategies.
One notable achievement was the development of an automated lead scoring algorithm that not only improved lead assessment precision but also streamlined operational efficiency. By targeting promising leads, X Education could reduce sales team costs significantly.
Our journey involved thorough data exploration, preprocessing, and model development, ensuring consistency and mitigating bias. We systematically evaluated models, with the tuned Random Forest model achieving an impressive F1 score of 0.9287 and a precision score of 0.9527 on the test dataset.
This data-driven journey provides X Education with actionable insights to enhance efficiency and revenue growth, positioning the company for a transformative phase.
If you've read until here, thank you. I hope you found this information helpful and interesting in some way. Your feedback is greatly appreciated. Best regards.